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Density Functional Theory calculations traditionally suffer from an inherent cubic scaling with 
respect to the size of the system, making big calculations extremely expensive. This cubic scaling 
can be avoided by the use of so-called linear scaling algorithms, which have been developed during 
the last few decades. In this way it becomes possible to perform ab-initio calculations for several 
tens of thousands of atoms or even more within a reasonable time frame. However, even though the 
use of linear scaling algorithms is physically well justified, their implementation often introduces 
some small errors. Consequently most implementations offering such a linear complexity either 
yield only a limited accuracy or, if one wants to go beyond this restriction, require a tedious fine 
tuning of many parameters. In our linear scaling approach within the BigDFT package, we were 
able to overcome this restriction. Using an ansatz based on localized support functions expressed 
in an underlying Daubechies wavelet basis - which offers ideal properties for accurate linear scaling 
calculations - we obtain an amazingly high accuracy and a universal applicability while still keeping 
the possibility of simulating large systems with only a moderate demand of computing resources. 


I. INTRODUCTION 

The Kohn-Sham formalism of Density Functional The¬ 
ory (DFT) 1 2 has established itself as one of the most 
powerful electronic structure methods due to its good 
balance between accuracy and speed and is thus popular 
in various fields such as physics, chemistry, biology, and 
material sciences. Kohn-Sham DFT maps the problem 
of interacting electrons onto a problem of non-interacting 
quasi-electrons: given a system containing N electrons, 
the 3 N -dimensional many-electron wave function is as¬ 
sumed to be given by a single Slater determinant built 
from N orthonormal single particle orbitals ^(r) which 
a priori extend over the entire system. The drawback 
of this approach is that it has an inherent cubic scaling 
due to the orthonormality which is imposed on these so- 
called Kohn-Sham orbitals. To orthonormalize a set of N 
functions it is necessary to calculate the scalar products 
among all of them, which has the complexity 0(N 2 ). In 
addition the cost of calculating one scalar product is pro¬ 
portional to the size of the underlying basis, Since 

both N and are proportional to the total size of 

the system - typically indicated by the number of atoms 
n at - one ends up with a complexity 0(n 3 t ). This behav¬ 
ior is common to all programs using a systematic basis 
set, be it plane wave^P^ finite element# or wavelet d 7 . 
Even when not using a systematic basis set, e.g. in the 
case of GaussianP or atomic orbitald 9 , the complexity 
remains 0(n 3 t ) due to the matrix diagonalization which 
is required in a straightforward implementation. Due to 
this limitation this standard approach is only suitable 
for systems containing a few hundred or at most a few 
thousand atoms. 

In order to obtain an algorithm which can scale linearly 
with respect to the size of the system, one thus has to 
abandon the concept of the extended Kohn-Sham orbitals 


and work with other quantities which are strictly local¬ 
ized. One such quantity is the density matrix F(r, r'), 
which is related to the Kohn-Sham orbitals via F(r, r') = 
JT with fa being the occupation number 

of orbital i. For insulators or metals at finite tempera¬ 
ture it can be shown that the elements of F(r, r') decay 
exponentially with the distance between r and 
Consequently, if one neglects elements which are below a 
given threshold, the number of non-zero elements scales 
only linearly with respect to the size of the system, thus 
paving the way towards an algorithm with this reduced 
complexity. During the past decades this fact has led to 
a number of algorithms which are capable of performing 
linear scaling calculations; an overview is given in Ref.fTTl 

One particular approach is to write the density ma¬ 
trix in separable form by introducing a set of so-called 
support function d 18 * 19 ! an idea that has already been 
used in various codes such as ONETEFP ConqueslP, 
CP2KP and SIESTA.P These support functions can 
also be thought of as a localized basis to directly rep¬ 
resent the Kohn-Sham orbitals. Even if there exists an 
ideal set of localized functions - namely the maximally lo¬ 
calized Wannier functiond^ - this is of course not known 
beforehand. 

Another important aspect is the choice of the under¬ 
lying basis which is used to give a numerical representa¬ 
tion of the support functions. Ideally, as we would like 
to efficiently describe localized functions, such a basis set 
should at the same time feature orthogonality and com¬ 
pact support. This can indeed be offered by Daubechies 
wavelets^, making them an ideal basis set for linear scal¬ 
ing calculations. Moreover, wavelets are able to yield a 
reasonable precision with only moderate values for the 
numerical grid spacing, thus keeping the number of basis 
functions relatively small for a systematic approach. In 
addition the multiresolution properties of wavelets allow 
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for an adaptive resolution and thus to have a finer mesh 
only in those regions - close to the atoms - where it is 
actually required. 

As for a given system the localized Wannier functions 
are unknown, there are two possibilities: either use a 
larger number of support functions and hope that the 
most important features of the Wannier functions can be 
captured in this way, or to use a smaller number of sup¬ 
port functions and optimize them in situ (namely within 
a predefined localization region) to get as close as possible 
to the Wannier functions. We have recently implemented 
a linear scaling code based on Daubechies wavelets, dis¬ 
tributed in the BigDFT packag^^, where we decided to 
follow the second approach, in this way obtaining support 
functions which are optimally adapted to their chemical 
environment and thus yield a very high accuracy. Conse¬ 
quently we are in this way able to drastically reduce the 
size of the subspace in which the density matrix is repre¬ 
sented. Together with the aforementioned properties of 
Daubechies wavelets this leads to an optimally reduced 
number of degrees of freedom and enables us to perform 
calculations on thousands of atoms while only requiring 
moderate amounts of memory and CPU time. 

In this paper we will present the progress we have re¬ 
cently made for this linear scaling code. Thanks to the 
previously presented ingredients our code offers the abil¬ 
ity to perform DFT calculations which at the same time 
yield an astonishingly high accuracy, are universally ap¬ 
plicable, exhibit a linear complexity, and require only a 
very small amount of computational resources. In partic¬ 
ular it offers the advantage that one does not need to do a 
tedious fine tuning of various parameters and in particu¬ 
lar of the basis, but can straightforwardly get reasonable 
results using a set of default values. 

We will show that even with a relatively simple set of 
input parameters one can calculate energies with an ab¬ 
solute accuracy of the order of lOmeV/atom for a large 
variety of isolated systems. From a computational point 
of view, as the approach scales linearly with respect to 
system size, it is possible to use the concept of “CPU min¬ 
utes per atom” to evaluate the computational resources 
needed for a full DFT calculation. We will first present 
the basic principles of our approach and then demon¬ 
strate all of the aforementioned properties by various ex¬ 
amples. 


II. THEORY AND ALGORITHM 


A. Overview of the algorithm 


Using the formalism based on the density matrix 
F(r, r'), the band structure energy, which is the central 
quantity to be calculated, is given within the framework 
of Kohn-Sham DFT by the expression 


Ebs 


/ 


H(r')F(r,r') 


dr'. 


(i) 


Here 'H(r) is the Kohn-Sham Hamiltonian, defined by 

H( r) = - 1 V 2 + V K s(p( r), r) + Vpsp(r), ( 2 ) 

where V K sip{r),r) = V ext {v) + f |£^ydr' + V X c(r ) is 
the Kohn-Sham potential containing the external poten¬ 
tial, the Hartree potential and the exchange-correlation 
potential, and Vpsp is the pseudopotential representing 
the ions in the system. In BigDFT the pseudopotentials 
are norm-conserving GTH-HGhP 7 pseudopotentials and 
their Krack variants^, possibly enhanced by a nonlinear 
core correction 29 . 

The linear scaling version of BigDFT is based on a 
separable ansatz for the density matrix. Using so-called 
support functions <p a (r) and a density kernel K the den¬ 
sity matrix can be written as 

F(r, r') = ^2 (3) 

a,(3 

The charge density, which enters the Kohn-Sham Hamil¬ 
tonian, is given by the diagonal part of the density matrix 
and thus reads 


p(y) = U Ki*)K a f)<t>f){Y). (4) 

a,/3 

Using the orthonormality relations between the support 
functions and their duals, f 0 a (r)0^(r)dr = S a p, it fol¬ 
lows that the elements of the density kernel are given 
by 


H a/ 3 = JJ 4>a{r)F(r, r') 0 / 3 (r')drdr'. (5) 

Writing analogously the elements of the Hamiltonian ma¬ 
trix as 


H a p = J (f) a (r)H(r)(j)p(r')dr, (6) 

it follows that the band structure energy is given by 


E BS = tr(KH). 


( 7 ) 


The algorithm thus consists of two parts: finding a good 
set of support functions (fi a , which give rise to the Hamil¬ 
tonian matrix H, and determining the density kernel K in 
the dual (and never explicitly calculated) basis (r) . In 
order to reach linear scaling, the support functions must 
be localized and the density kernel sparse. This can be 
achieved by introducing cutoff radii around the center of 
each support function and setting to zero all components 
which lie outside of the region; for the density kernel an 
element K a p is set to zero if the centers of <fi a and are 
farther away than the sum of the kernel cutoffs of the re¬ 
gions a and /3 - these kernel cutoffs actually correspond 
to the cutoffs of the dual support functions (cf. Eq. [5| 
and are in general a bit larger than the support function 


cutoffs themselves, see the discussion in Sec. IIB The 





3 



FIG. 1. Flowchart of the linear scaling algorithm. It consists 
of a double loop structure: in the inner first loop the support 
functions are optimized with a fixed density kernel, in the 
second inner loop the density kernel is optimized using a fixed 
set of support functions. These two loops are then iterated in 
an outer loop until overall convergence is reached. 


support functions are again expanded in terms of an un¬ 
derlying wavelet basis, as is explained in more details in 
Ref. 26. 

A flowchart of the algorithm is shown in Fig. [l] It 
consists of a nested double loop structure, with one outer 
loop and two inner loops. In the first inner loop, the sup¬ 
port functions are optimized using a fixed kernel and thus 
also using a fixed potential and Hamiltonian. Therefore 
it is not advisable to perform too many iterations in this 
loop, but rather to exit even if the convergence threshold 
has not been reached yet. In the second inner loop, the 
density kernel is optimized using a fixed set of support 
functions. In contrast to the first inner loop, this second 
one is done self-consistently, i.e. the Hamiltonian is up¬ 
dated in each iteration. These two loops are then iterated 
in the outer loop until overall convergence is reached. 


B. Optimization of the support functions and 
density kernel 

The support functions are optimized by minimizing a 
target function. Ideally this target function should lead 
to strongly localized support functions which at the same 
time yield a very high accuracy. For the latter property 
the correct quantity to be minimized is the band struc¬ 
ture energy of Eq. 0 . For the first property the choice 
is not unique; one possibility would be 

n = J2(^a\n a \4>a), ( 8 ) 

a 

where is the Kohn-Sham Hamiltonian including a 
confinement, i.e. H a = l~i + c a (r — R a ) 4 , with R a being 
the center of the localization region. In order to combine 
the two properties we define our target function as 

n = J2 R aa (K\H a \K) + E K «e WMtp), (9) 

cn /3y£a 

where the prefactor for the confinement is smoothly re¬ 
duced during the run, as explained in more detail in 
Ref. [26] In this way we have a strong confinement in 
the beginning, leading to a decent localization, and still 
obtain a high precision as we correctly minimize the band 
structure energy in the end. Since the decrease of the con¬ 
finement is done automatically - taking into account the 
properties of the system -, this procedure is universally 
applicable without the need for any fine tuning. 

In contrast to other linear sca ling co des employing the 
same support function approach^®! we decided to im¬ 
pose an additional constraint and to keep the support 
functions approximately orthogonal. In other terms, we 
optimize (j) a such that the overlap matrix 

s a/ 9 = (<?y </>/?) (io) 

is close to the identity matrix. As the kernel cutoff is 
related to the extension of the dual support functions 
(see Eq.©), the sparsity of K is governed by the cut¬ 
off implicitly used for these dual functions. By choosing 
the support functions to be orthogonal the dual and non¬ 
dual entities are identical and the cutoffs for the support 
functions and the density kernel matrix elements can con¬ 
sequently be chosen along the same lines, leading to a 
degree of sparsity for the kernel which is comparable to 
that of the overlap matrix itself. 

In addition, quasi-orthogonal support functions lead 
to an overlap matrix which is close to the identity and 
whose inverse can thus be cheaply calculated using poly¬ 
nomial expansions. This property is in partic ular impor¬ 
tant for the Fermi Operator Expansion 30 ®! (FOE) pro¬ 
cedure that we use to determine the density kernel. The 
efficiency of this approach relies heavily on the sparsity 
of the matrix S' -1 / 2 HS -1 / 2 . Even if H is very sparse 
thanks to the cutoff radii of the support functions, this 
property is much less pronounced for S -1 / 2 in the case of 
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non-orthogonal support functions. Thus, if one wants to 
keep a high degree of sparsity also for the matrix prod¬ 
uct, it is indispensable to work with a set of orthogonal 
support functions. 

In spite of all these benefits, it must be stated that or¬ 
thogonality and locality are in general two contradicting 
properties; consequently a strict enforcement of the or¬ 
thogonality might lead to convergence problems. There¬ 
fore we only perform an explicit orthogonalization in the 
very beginning; in the following the orthogonality is only 
approximately conserved by the use of a Lagrange multi¬ 
plier in the support function gradient. This is enough to 
keep the overlap matrix diagonally dominant - the off- 
diagonal elements being typically of the order of 0.1 - 
and thus to maintain the aforementioned benefits. 

For the determination of the density kernel the code 
offers several possibilities, each one having its particular 
strengths and thus areas of application. In this paper we 
always used the FOE method, as it is the only method 
that allows to perform calculations for very large systems. 
For more details we again refer to Ref. [26] 

III. ASSESSMENT OF THE ACCURACY 

In order to assess the precision of our code, it has to 
be compared with a reference. We use as such the cubic 
version of the code^, which has been shown to yield very 
accurate results 29 . Thus, by demonstrating that the lin¬ 
ear version of BigDFT is able to reproduce the results of 
the cubic scaling version, we can be sure to have reached 
a very high level of precision. All calculations were done 
using free boundary conditions and the PBE 32 functional 
unless otherwise stated. 


A. Comparison of energies and forces 

One of the most important features of our code is the 
ability to yield a high level of precision without the need 
for doing a tedious fine tuning of the parameters. As an 
illustration of this property, we show the energies and 
forces calculated with both the linear and cubic scaling 
version for six systems with rather different electronic 
structures; they are depicted in Fig. [2] The chosen pa¬ 
rameters were very similar for all systems, with the main 
difference being the various cutoff radii used for the sup¬ 
port function regions of the different atom kinds. To keep 
it simple, the choice was just based on the row in which 
the element appears in the periodic table: 5.5bohr for 
elements of the first row, 6.3bohr for those of the second 
row, and 7.0 bohr for those of the third row. The kernel 
cutoff was set to 9.0 bohr; an automatic adjustment for 
technical reasons was done by the code if the value was 
smaller than the cutoff radius of the support function 
plus 8 times the grid spacing, see Ref. [26] for details. The 
number of support functions per atom was 1 for H, 9 for 
Si, and 4 for all the other elements. These parameters 



FIG. 2. Illustration of the six systems used for the compari¬ 
son of the linear and cubic energies and forces: Vitamin B 12 , 
Chlorophyll, Si 87 H 76 , C 60 , (H 2 O) 100 , DNA. 

seem to be fairly universal, meaning that any other sys¬ 
tem can be readily calculated without the need for any 
adjustments. 

The results are summarized in Tab.[IJ As can be seen, it 
is possible to get with these input parameters a fairly con¬ 
stant energy offset between the linear and cubic versions 
of about lOmeV/atom. The difference of the force norm 
is typically of the order of 10 -3 hartree/bohr. These are 
more than reasonable values in view of the fact that they 
were obtained without any fine tuning. Last but not 
least, it is worth noting that support functions are op¬ 
timized such that Pulay forces 33 are absent in our ap¬ 
proach 26 !, and the evaluation of the forces is therefore 
straightforward. 


B. Calculations of energy differences 

Absolute energies are always somewhat arbitrary as 
they are only defined up to an additive constant. There¬ 
fore energy differences are more meaningful as this am¬ 
biguity vanishes. For the linear scaling version there is 
in addition the benefit that the offset caused by the fi¬ 
nite cutoff radii cancels to a large degree and it is thus 
possible to get very close to the exact result. The value 
of lOmeV/atom mentioned in the previous section is the 
error in the absolute energy and can thus be considered 
as an upper bound; in fact it is however possible to ob¬ 
tain a much higher accuracy than one might think at first 
sight. 
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cubic 

linear 


difference 


energy 

force norm 

energy 

force 

norm 

energy 

force norm 


hartree 

hartree/bohr 

hartree 

hartree/bohr 

meV/atom hartree/bohr 

Vitamin B 12 

-926.78 

2.13-10“ 3 

-926.71 

1.93 

• 10“ 2 

11.43 

1.72- 10 -2 

Chlorophyll 

-476.70 

3.05 • 10 -3 

-476.64 

1.24 

• 10“ 2 

12.33 

9.38 • 10 -3 

Si 87 H 76 

-386.79 

7.80 • 10 -4 

-386.69 

1.01 

• 10“ 2 

17.20 

9.27- 10 -3 

^60 

-341.06 

2.69 • 10 -4 

-341.02 

7.36 

• 10“ 3 

17.23 

7.09 • 10“ 3 

(H 2 O) 100 

-1722.99 

5.23 • HP 1 

-1722.87 

5.26 

• 10 _1 

10.89 

2.80- 10 -3 

DNA 

-4483.12 

5.60 • 10 1 

-4482.84 

5.63 

• IQ- 1 

10.69 

2.40 • 10 -3 


TABLE I. Comparison of the energies and forces for different systems, calculated using the linear and cubic versions. Linear 
energies have an offset of about lOmeV/atom with respect to the cubic version, and the deviations of the linear forces from 
their cubic counterparts is of the order of 10 -3 hartree/bohr. The first four systems were close to their geometrical ground 
state, whereas the last two were unrelaxed. All systems were calculated with similar parameters. 


pure 

impurity 

difference 

eV 

eV 

eV 


cubic 

-53179.7148 -53251.5162 

71.8014 

linear 

-53168.4248 -53240.3371 

71.9123 

difference 

11.2900 11.1791 

0.1109 

relative difference 

- 

0.15% 


TABLE II. Energy differences between the pure and the im¬ 
pure Si wire, for both the linear and the cubic version. The 
energy offset between the linear and the cubic version cancels 
to a large degree and thus yields a very accurate result for the 
energy difference. 


As an example, we calculated the energy difference be¬ 
tween a hydrogen-passivated silicon wire in its pure state 
and another one which contains a defect. As a defect we 
chose a substitutional atom; one of the silicon atoms was 
replaced by phosphorus. These systems have been used 
as a case study to determine the binding energy of impu¬ 
rities in semiconductors using charged DFT calculations, 
see Ref. [34] The two configurations are depicted in Fig. [ 3 ] 
Apart from the fact that silicon is a rather delicate sys¬ 
tem, this defect is even more challenging as it requires 
the comparison of two systems with a different number 
of electrons. The calculations were performed with the 
analogous parameters as those used for the benchmarks 
in Sec. EMI again demonstrating their universal char¬ 
acter. From the results in Tab. IQ it becomes clear that 
energy differences can indeed be calculated with a very 
high precision. Whereas the offset between the linear 
and cubic version in the absolute energy is about 11.2 eV 
- which is about 17meV/atom, in agreement with the 
results of Sec. Ill A -, the discrepancy between the two 
version becomes as little as 0.11 eV for the energy differ¬ 
ence between the pure and the impure wire. Comparing 
this mismatch with the correct value of 71.8 eV one gets 
a relative error of only 0.15%. 



FIG. 3. The two silicon wires (consisting of 660 atoms) which 
are used to benchmark the calculation of energy differences. 
Whereas the one on the left side is a pure wire, there is a defect 
in the one on the right side with one silicon atom having been 
replaced by a phosphorus atom. The red circle highlights this 
substitutional atom. 


C. Geometry optimizations 


A good test to check at the same time the accuracy 
of the energies and the forces is to perform a short ge¬ 
ometry optimization for both the linear and the cubic 
version, starting from the same non-equilibrated struc¬ 
ture. If the forces calculated by the linear version are 
accurate enough, this should lead to identical trajecto¬ 
ries and thus to a parallel evolution of the energies and 
the forces. As an example we took an alkane consisting 
of 302 atoms. The results are shown in Fig. [4] As can 
be seen, the offset in the energy between the linear and 
the cubic version remains more or less constant through¬ 
out the entire geometry optimization, and is again of 
the order of lOmeV/atom. In addition, the forces are 
pretty much identical, with variations of the order of 
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FIG. 4. Evaluation of the energy (upper panel) and the forces 
(lower panel) as a function of the number of iterations in the 
geometry optimization. For the energy, the offset remains 
rather constant throughout the entire run, leading to two 
parallel curves; as can be seen from the inset, the offset only 
varies within less than 1 mev/atom. The curves for the forces 
are more or less superposed, with a difference of the order of 
10 -4 hartree/bohr. The large plateau in the middle where the 
force remains constant is due to the fact that an adjustment 
of the bond lengths can only start at the two endpoints of 
the alkane and then slowly propagate like a wave towards the 
center of the molecule. 

only 10 -4 hartree/bohr. Together this shows that the 
two trajectories performed by the linear and the cubic 
versions are identical, demonstrating the high precision 
of the forces calculated by the linear version. Again it 
must be stressed that these results were obtained with 
the same standard set of parameters which had been used 
in the previous examples. 


D. Energetic ordering 

The statement that our approach is universally appli¬ 
cable does not mean that it can blindly be applied to any 
system. There are situations where the “noise” intro¬ 
duced by the finite cutoff radii is larger than the “signal” 
one is looking for; in this case the results of our approach 
- as of any method relying on such a truncation - must 
be interpreted with care. By universally applicable we 
rather mean that we can straightforwardly - i.e. without 
the need of doing a tedious fine tuning of the input pa¬ 
rameters and the basis set - apply our method to those 
cases where the “signal to noise ratio” is large enough 
even when finite cutoff radii are used. 

As an example for a system where the linear scaling 



FIG. 5. The 12 energetically lowest configurations for the 
fullerenes B 12 C 48 . 


version is not able to reproduce the correct results of the 
cubic version we present the energetic ordering for the 12 
energetically lowest structured^ of B 12 C 48 , depicted in 
Fig-! Such boron-carbon fullerenes are a very delicate 
system; previous studies 36 37 have even led to different 
conclusions. In order to successfully describe the ener¬ 
getic ordering of these configurations it is indispensable 
to catch energy differences - i.e. signals - of the order of 
1 meV/atom, which is possible with the cubic version of 
BigDFT used in Ref. [35, The energy spectra for these 
12 configurations are shown in Fig. [6j for three different 
setups: the cubic version using the PBE functional, the 
linear version using as well the PBE functional, and the 
cubic version using the LDA^functional. As can be seen, 
there are - even if the main features of the system are 
well captured - notable differences between the cubic and 
the linear version: whereas the ground state is the same, 
there has been some reordering of the excited states; in 
particular the energetic gap between the ground state 
and the first local minimum is wrong. This shows that 
for this system the linear scaling approach using finite 
cutoff radii is not appropriate. However the third panel 
demonstrates that these boron-carbon fullerenes are in¬ 
deed an extremely delicate system. As can be seen, the 
energetic ordering is also considerably modified if the cu¬ 
bic version is used with another functional: using LDA 
instead of PBE, the energy levels of the ground state 
and the first excited one remain identical, but the higher 
levels are completely jumbled. Due to this very high sen¬ 
sitivity it is thus not surprising that the linear version 
is not able to satisfactorily reproduce the results of the 
cubic version, as the noise introduced by the finite cutoff 
radii is higher than the signal we look for. 

IV. SCALING WITH SYSTEM SIZE 

The goal of this work was to obtain a code which at the 
same time yields a very high precision and scales linearly. 
The first property has already been shown in detail in 
Sec. |III| now it remains to demonstrate the second prop¬ 
erty. To this end we performed single point calculations 
for randomly generated water droplets of various size, for 
both the linear and the cubic version. The runs were per¬ 
formed in parallel, using in total 6400 cores. The results 
for the total runtime and the memory peak are shown 
in Fig. [7| As can be seen both quantities clearly exhibit 
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cubic, PBE linear, PBE cubic, LDA 

FIG. 6. Energetic ordering of the lowest configurations of 
B 12 C 48 . From left to right: cubic version with PBE, linear 
version with PBE, cubic version with LDA. All energy levels 
have been shifted such that the ground state is always at zero. 
The colors correspond always to the same level, demonstrat¬ 
ing that the energetic ordering is not always the same. 


a strict linear scaling. The cubic version, on the other 
hand, shows a much steeper increase and does not per¬ 
mit calculations beyond about 2000 atoms. Due to the 
spherical geometry the degree of sparsity of the matrices 
is rather low and it is therefore more difficult for the lin¬ 
ear scaling version to exploit this property. Nevertheless, 
even for the smallest droplet the linear scaling version is 
faster than the cubic one. By extrapolating the curves 
for the runtime the crossover point for this system can be 
estimated to be at about 200 atoms, which is in a range 
still easily accessible by the cubic version of BigDFT for 
production runs (see e.g. Ref. 133 . 

Moreover it becomes clear that the computational re¬ 
sources consumed by the linear version are only moder¬ 
ate. For a complete single point calculation for 10000 
atoms only a few thousand CPU hours are required, 
which is a rather small amount considering the capac¬ 
ities of current supercomputers. This demonstrates that 
we were able to drastically reduce the number of degrees 
of freedom while keeping a very high level of accuracy. 

For a truly linear scaling regime it is also interesting to 
look at the ”CPU minutes per atom” which are needed 
to perform a fully self-consistent calculation. This quan- 



2000 4000 6000 8000 10000 12000 14000 

number of atoms 

FIG. 7. Total runtime for one single point calculation (upper 
panel) and memory peak per MPI task (lower panel) for water 
droplets of various size, ranging from 600 to 15000 atoms. 
The linear scaling version indeed exhibits linear scaling, as 
indicated by the linear extrapolations. The small deviations 
are mainly caused by a slightly different number of iterations 
required to reach convergence. 



number of atoms 

FIG. 8. CPU minutes per atom needed for a fully self- 
consistent calculation for DNA fragments and water droplets 
of different sizes. As expected, this quantity remains constant 
above a certain size, with the prefactor mainly influenced by 
the geometry of the system to be calculated. 






































tity is shown in Figj8]for DNA fragments and the water 
droplets used for the aforementioned benchmark. As is 
to be expected for a linear scaling code, this quantity 
remains constant above a certain critical size where the 
linear scaling parts of the code start to dominate. More¬ 
over one can easily see how the geometry of the system 
influences the prefactor, resulting in higher values for the 
water droplets than for the DNA fragments. 


V. PARALLELIZATION 
A. Parallel efficiency 

Even if the linear scaling version of BigDFT requires 
only moderate resources, it is indispensable to have an 
efficient parallelization scheme in order to keep the run¬ 
times short and thus to have the possibility of perform¬ 
ing advanced calculations such as geometry optimizations 
within a reasonable time frame. To this end we paral¬ 
lelized our code with a mixed distributed/shared mem¬ 
ory parallelization scheme using MPI and OpenMP. It 
is worth noting that the shared memory parallelization 
is not just an additional speedup. Reducing the num¬ 
ber of MPI tasks and in turn increasing the number of 
OpenMP threads helps to improve the MPI load balanc¬ 
ing and to reduce communication overhead; moreover it 
can substantially reduce the memory peak per compute 
node and thus help in situations where this resource is 
critical. 

Reaching an efficient parallelization for a linear scaling 
code is not easy. First of all we have a small number of 
degrees of freedom and thus little workload than can be 
shared among the cores. Moreover there are also load 
balancing problems which can arise due to the different 
sizes and surroundings of the localization regions. Nev¬ 
ertheless we were able to reach a degree of parallelism 
which is more than sufficient to efficiently perform large 
calculations. As an illustration we show the scaling for a 
fully self-consistent calculation of a DNA fragment con¬ 
sisting of 14300 atoms. The number of MPI tasks ranged 
from 160 to 3200, each one being split into 8 OpenMP 
threads; thus in total we have a range of 1280 to 25600 
cores. The results for the runtime and the memory peak 
are shown in Fig. [9j For the runtime we show the CPU 
minutes per atom as well as the total walltime. 

As can be seen from the CPU minutes per atoms re¬ 
maining almost constant the speedup is very good up to a 
few thousand cores. On the other hand this quantity in¬ 
creases for very large number of cores, revealing that the 
speedup becomes only moderate in this range. However 
this does not mean that the code is poorly parallelized. 
The reason is rather that, despite of the considerable 
size of the system, the number of degrees of freedom is 
so small that using too many cores simply results in a 
very small number of operations to be executed by a sin¬ 
gle core and thus to a poor ratio of computation and 
communication. Moreover it is more difficult to reach an 




FIG. 9. CPU minutes per atom and total walltime (upper 
panel) and memory peak (lower panel) as a function of the 
total number of cores for a DNA fragment consisting of 14300 
atoms. The runs were all performed using 8 OpenMP threads 
for each MPI task, thus only the latter number was varied. 


efficient load balancing in such a situation, resulting in 
many cores being idle most of the time. Consequently 
this degradation in the parallel speedup will be shifted 
towards larger values if one increases the system size. 

Nevertheless it is worth noting that even for the largest 
number of cores there is still some speedup, as can be 
seen from the total walltime decreasing steadily. Even in 
this range, the CPU time per atom remains of the same 
order as the one needed at the crossover point with the 
cubic code. As calculations with the cubic BigDFT code 
are already accessible in this range (as pointed out in the 
previous section), production runs of very large systems 
might become feasible by linearly scaling the computing 
resources needed at the crossover point. The memory 
parallelization, shown in the lower panel, does not suffer 
from any degradation, and we come close to a perfect 
scaling. In summary these results show that our code 
has an excellent level of parallelism as long as one keeps 
a good balance between the size of the system and the 
computational resources that are utilized. 
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B. Parallelization of the sparse matrices 


Some strategies to obtain an efficient parallelization 
have already been outlined in Ref. 26} Here we present 
a new concept which we developed to reach an efficient 
handling of the sparse matrices. 

The problem of the transposed approactP^ for calcu¬ 
lating the overlap matrix (and similarly the Hamiltonian 
matrix) is that it requires a global reduction operation 
at the end. This can pose a bottleneck both from the 
viewpoint of runtime and memory. Moreover this global 
reduction is wasteful due to the locality which is inher¬ 
ent to the linear scaling approach. Each MPI task only 
needs to know a small portion of the entire matrices and 
a global reduction is thus not necessary. To circumvent 
this issue, the MPI tasks are regrouped in taskgroups, 
sharing a portion of the entire matrix - obviously this 
portion is chosen such that it contains those parts of the 
global matrix which are actually needed by each indi¬ 
vidual task. In addition the taskgroups are defined such 
that each MPI task belongs to at most two taskgroups. 
This has two advantages: first of all each MPI task only 
needs to hold a copy of a part of the global matrix, thus 
reducing the memory requirements, and secondly the re¬ 
duction only needs to be performed within a taskgroup. 
Thanks to the use of non-blocking collective MPI rou¬ 
tines a task can participate in two reduction operations 
at the same time without the risk of serializing the code. 

The concept is visualized in Fig. [10] for a toy system 
consisting of 8 MPI tasks. The matrix subparts which 
are needed by each MPI task are indicated by rectangles. 
The MPI tasks are initially regrouped in 3 taskgroups: 
taskgroup I is responsible for the part needed by the tasks 

1- 2, taskgroup II for those needed by the tasks 3-6, and 
taskgroup III for those needed by the tasks 7-8. The 
taskgroups are then enlarged in order to guarantee that 
the reduction is done correctly. Therefore taskgroup I 
includes the tasks 1-4, taskgroups II includes the tasks 

2- 8, and taskgroup III includes the tasks 5-8. 

Obviously the toy system is too small to get a real 

benefit from the taskgroups. However for large systems 
there can be a substantial gain. As an example we show 
in Tab. m the timings obtained with and without the 
matrix taskgroups for two specific operations. As can be 
seen, the calculation of the overlap and Hamiltonian ma¬ 
trix (both computation and communication) can be ac¬ 
celerated by more than a factor of 3, demonstrating that 
actually most of the time was spent in the reduction and 
not in the computation itself. The other operation, which 
is required at the end of the FOE procedure to build up 
the density kernel out of the partial results calculated by 
each task and is almost entirely communication based, 
can even benefit from a speedup of 16. 



□T 
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FIG. 10. Illustration of the concept of the matrix taskgroups 
for a toy system. There are 8 MPI tasks which are regrouped 
in 3 taskgroups. Each taskgroup is initially defined such as to 
form a matrix subset, and subsequently extended to include 
all the processes which share data of the initial subset. Care 
is taken in not including each MPI task in more than two 
different taskgroups. 



overlap calculation gather/compress 


seconds 

seconds 

without taskgroups 

88.9 

266.9 

with taskgroups 

25.2 

16.4 


TABLE III. Speedups offered by the matrix taskgroups for 
two specific operations. 

VI. CONCLUSION 

Using a set of strictly localized and quasi-orthogonal 
support functions which are adapted in-situ to their 
chemical environment we were able to develop a code 
which can perform accurate DFT calculations with a 
strict linear scaling with respect to the size of the system. 
Thanks to the compact support property of the underly¬ 
ing Daubechies wavelets, this reduction of the complexity 
with respect to the traditional cubic approach does not 
come at the cost of a loss of precision. Indeed, we are 
able to get an excellent level of accuracy with a set of 
standard input parameters and can thus get rid of the 
tedious fine tuning which is often needed for other 0(N ) 
approaches. We believe this would considerably simplify 
the usage of this code by end-users, as this fine-tuning 
usually restricts the usage G(N) approaches to a com¬ 
munity of specialists. 

Moreover we were able to drastically cut down the 
degrees of freedom needed for an accurate calculation, 
thus reducing the prefactor of the scaling law and con¬ 
sequently considerably diminishing the amount of com¬ 
putational resources required. Even for systems con¬ 
taining 10000 atoms a complete single point calculation 
can be done using only of the order of one thousand 
CPU hours. Furthermore the linear scaling behavior of 
the program motivates the notion of “CPU minutes per 
atom”, making it easy to estimate the computational 
resources needed for the simulation of a given system 
of any size. Together with the efficient parallelization 
scheme this low consumption of computational resources 
also paves the way towards more expensive tasks such as 
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geometry optimizations. 
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